A complete theoretical framework for inferring horizontal gene transfers using partial order sets

We present a method for detecting horizontal gene transfer (HGT) using partial orders (posets). The method requires a poset for each species/gene pair, where we have a set of species S, and a set of genes G. Given the posets, the method constructs a phylogenetic tree that is compatible with the set of posets; this is done for each gene. Also, the set of posets can be derived from the tree. The trees constructed for each gene are then compared and tested for contradicting information, where a contradiction suggests HGT.


Introduction
Most work in evolutionary genomics has focused on vertical gene transfer from one species to a lineal descendant. Much recent work has been directed towards the phenomenon of horizontal gene transfer (HGT) [1]. Because of the impact of HGTs on the ecological and pathogenic character of genomes, algorithms are sought that can computationally determine which genes of a given genome are products of HGT events. Numerous strategies have employed nucleotide composition of coding sequences to predict HGT. Previous methods marked the genes with a typical G + C content. Other methods used codon usage patterns to predict HGT. Also, many models used nucleotide patterns for genomic signature, these models have been analyzed using sliding windows, Bayesian classifiers, Markov models, and support vector machines. While no previous work uses partial orders to investigate HGT, we do summarize computational research for detecting HGT in the later Related Literature section.
Suppose that we have complete, annotated genomes for m species. Further, suppose that we have selected a set of n genes, from some reference genome or otherwise, for analysis. If we know the relative distances between each pair of species per gene, then we have a set of partial orders defining the relative relationship among species that can be used to identify which genes are candidates for HGT. Given a poset for each gene, a tree corresponding to that gene is constructed; different trees suggest genes that are candidates for HGT. Once HGT is indicated, additional time-related information can be brought to bear to determine the relative order of a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 events and to establish direction. In fact, our algorithm predicts direction as illustrated in Fig 1. Suppose that we have complete, annotated genomes for species s 1 , s 2 , . . ., s m . Further, suppose that we have selected a set of genes, from some reference genome or otherwise, for analysis. Let those genes be g 1 , g 2 , . . ., g n . Standard methods for obtaining the set of genes, such as the one in Lake and Rivera [2], can be followed. BLASTing gene g k in species s i against a database of genes from all m species, we obtain a bit score B(g k ; s i , s j ) of a best alignment of that gene against the same gene in species s j . If g k is not found in s j , then set B(g k ; s i , s j ) = 0. In general, the higher B(g k ; s i , s j ) is, the better the match between gene g k in species s i and gene g k in species s j . There is no need to take special notice of an absent gene, since B(g k ; s i , s j ) = 0 is a meaningful substitute for a Boolean value representing presence or absence of a gene.
There is another quantity associated with the (g k , s i , s j ) triple. Define T(g k ; s i , s j ) to be the true evolutionary distance, this means what actually happened during the process of gene evolution, in time between the g k gene of s i and the g k gene of s j . For example, if the most recent common ancestor of the two genes existed 20 million years ago, then T(g k ; s i , s j ) is 40 million years. While these T(g k ; s i , s j ) values cannot be measured directly, either absolute or relative values for times can be estimated using probabilistic models.
The B(g k ; s i , s j ) values are not random. In fact, a ranking of the B(g k ; s i , s j ) values for 1 � j � m should roughly match a ranking of the T(g k ; s i , s j ) values from the s i gene g k to all the other g k 's. In the absence of HGT or other horizontal evolutionary events, we must have T(g k ; s i , s j ) = T(g ℓ ; s i , s j ) for every pair of genes g k and g ℓ . Therefore, we expect that the rankings of the B(g k ; s i , s j ) and B(g ℓ ; s i , s j ) values will be similar in ways we want to explore. And, under reasonable assumptions, the distribution of relative distances should be consistent with predictions of  PLOS ONE coalescent theory. In particular, as evolutionary distances increase, there will typically be multiple genes that have the same T value from the g k gene in species s i . Moreover, the probability that two evolutionary events occur at the same instance in time is 0.
In the presence of horizontal evolutionary events, the patterns of rankings of the B and T values will be different for different genes, depending on which horizontal events each gene is involved in. Two genes that are involved in exactly the same horizontal events will have identical patterns in their T values and similar patterns in their B values.
If we use the rankings of the B values as an approximate substitute for the rankings of the unknown T values, then the rankings can be compared and clustered to identify groups of genes that participated in the same horizontal events. Fix a gene g k . Then there is a gene g k tree that represents the true evolutionary history of the g k 's in all the species. It is rooted at the most recent common ancestor of the m species. Our first goal is to define a computational problem to achieve this clustering and to design an efficient algorithm to solve the problem. In the following, proofs of results are elaborated. Note that Belal and Heath [3] is an earlier five-page announcement of these results.

Definitions
For a rooted (directed) tree T, let R(T) be the root of T, let I(T) be the set of internal nodes of T, and let L(T) be the set of leaves of T.
Let S be a finite set of species. An S-tree T = (V, E) is a rooted tree such that every internal node has outdegree at least two and a bijective labeling function λ: L(T) ! S. In particular, every S-tree has precisely |S| leaves. Fig 2 illustrates an S-tree for the case n � 2, where there is only one internal node, the root r = R(T). There are n leaves x 1 , x 2 , . . ., x n and λ(x i ) = s i . If Let T = (V, E) be an S-tree. Let u 2 V. The subtree rooted at u is T(u). The species set S(u) for u is the set of leaf labels in T(u).
Let T be an S-tree with an internal node x that has three or more children. A refinement step (on T at x) adds an internal node y to the tree T, where y is the parent of a proper subset of the children of x and y is a new child of x. An S-tree T 0 is a refinement of T if T 0 can be obtained by performing zero or more refinement steps on T. For example, in Fig 4, T 2 is a refinement of T 1 by a refinement step on T 1 at r. The refinement step applied adds one internal node y, which is the parent of s 1 and s 2 in T 2 ; y and s 3 are the direct children of r in T 2 .
Let X = {X 1 , X 2 } and Y = {Y 1 , Y 2 } be two partitions of S. Call such partitions with two elements each 2-partitions. Note that the deletion of an edge from an S-tree induces two connected subtrees and, hence, a 2-partition of S. X and Y are contradicting partitions if there exist four species s 1 , s 2 , s 3 , s 4 such that s 1 , s 2 2 X 1 , s 3 , s 4 2 X 2 , s 1 , s 3 2 Y 1 , and s 2 , s 4 2 Y 2 . Two S-trees T 1 and T 2 are contradictory if their exists an edge in T 1 and an edge in T 2 such that their induced 2-partitions are contradicting.
Let u, v 2 L(T), for some S-tree T. The most recent common ancestor MRCA(u, v) of u and v is the node w that is a common ancestor of u and v such that T(w) is the smallest rooted subtree in T containing both u and v.
A partial order is a binary relation � over a set S that is reflexive, antisymmetric, and transitive, i.e., for all a, b, c 2 S, we have that • a � a (reflexivity); • if a � b and b � a then a = b (antisymmetry); and • if a � b and b � c then a � c (transitivity).
A set with a partial order is a partially ordered set or a poset. If (S, �) is a poset and a, b 2 S, then a < b if and only if a � b and a 6 ¼ b. Note that a < b is transitive. The directed graph G = (S, <) is clearly a directed acyclic graph (DAG). The transitive reduction of G is the DAG on node set S that contains those edges (a, b) such that there is no c 2 S satisfying a < c < b. A Hasse diagram of < (which is also a Hasse diagram of �) is a drawing of the transitive reduction of (S, <) such that no arrows are included. An example of a Hasse diagram is shown in Fig 5. The diagram shown corresponds to the following poset: Let s i 2 S be a species. An s i -poset P = (S, � i ) is a poset with the property that, for every s j 2 S, we have s i � i s j . In other words, s i is the unique minimum element of P.
The s i -poset P i = (S, � i ) is compatible with S-tree T if, for all distinct triples x, y, z 2 L(T) such that λ(x) = s i , λ(y) = s j , and λ(z) = s k and such that s j � i s k , then we have the shortest path from either of x or y to z passes through MRCA (x, y). Given the tree shown in Figs 6, 7 shows an example of a poset that is compatible with the given tree, while Fig 8 shows an incompatible poset, where the poset indicates that s 3 is the closest species to s 1 , while, in the tree, the closest species to s 1 is s 2 .
Let P ¼ fP 1 ; P 2 ; . . . ; P n j P i is an s i À posetg be a set of posets. P is consistent if, for all posets P i , P j 2 P, whenever s j � i s k , then s i � j s k . For example, let

PLOS ONE
P 4 = {(s 3 , s 1 ), (s 3 , s 2 ), (s 1 , s 2 )}, then {P 1 , P 2 , P 4 } is inconsistent, since P 1 and P 2 indicate that s 1 and s 2 are closer to each other than to s 3 , while P 4 indicates that s 1 is closer to s 3 than to s 2 .

Related literature
Among the methods for detecting HGT addressed by many researchers is conditioned reconstruction. Conditioned reconstruction (CR) is a phylogenetic technique that utilizes gene absence/presence data to reconstruct phylogenetic relationships [4]. CR [2], compares a genomic sequence to another and according to whether a gene ortholog is present or absent supplies a P or A character state. The probability of a state transition is analyzed using Markov models. Given two genes, X and Y, four patterns are possible, PP, PA, AP, and AA. Many questions were raised on how to count the pattern AA. How can one identify genes that are missing from both genomes X and Y. To solve this problem, CR uses a conditioning genome, as a reference to which genes to be considered. A gene has to be present in both the conditioning genome and the genome being coded, in order to be considered present. An absent gene is present in the conditioning genome and absent from the genome under study. The conditioning genome has a big effect on the results obtained, as it represents the full set of orthologous genes coded during matrix development. In our approach, we avoid building our results on a conditioning genome, or any other input that would bias our results. However, the approach we present is similar to CR in the problem addressed and the use of information about all genes in the genomes. Bailey et al. [4] argue that CR cannot be used to distinguish between HGT and genome fusion. They suggest some refinements that make CR perform better. Bapteste and Walsh [5] question the ring of life hypothesis of Lake and Rivera [2]. They claim that it is not possible to reconstruct the ring of life in the presence of HGT. Bapteste and Walsh [5] see that the conditioning genome (CG) is more a tool than a biological concept, this genome

PLOS ONE
can exist anywhere in the tree of life and can not be used in evolutionary reconstruction. See Belal [6] for additional discussion of CR. Related methods are found in [7][8][9][10][11].
Other methods for detecting horizontal gene transfer are proposed by multiple researchers. Podell and Gaasterland [12] present the DarkHorse method for detecting HGT. They defined the LPI, lineage probability index, to measure HGT and species closeness. This measure relies on lineage key terms. The higher the LPI score for an organism, the closer it is to the query (reference) genome. Groups of closely related organisms, have similar LPI scores. Xiang et al. PLOS ONE [13] apply DarkHorse in analyzing the evolutionary relationship between Microsporidia and Fungi.
Moreover, phylogenetic reconstruction research contributed in solving many evolutionary problems. Nakhleh et al. [14] present a method for reconstructing phylogenetic networks using maximum parsimony. Their method is then studied and applied in [15]. Other networkbased methods are found in [16][17][18][19][20]. For example, Cardona, Pons, and Rosselló [17] investigate LGT (lateral gene transfer) networks that combine a principal rooted subtree with a set of

PLOS ONE
additional edges representing LGT. They present an efficient algorithm for constructing an LGT network from a set of phylogenetic trees.
Snir and Trifonov [21] present a method for detecting HGT. Their algorithm takes two genomes with their lengths and calculates the expectancy of each identical region's length to obtain a measure of confidence as to exceptional similarity. Abby et al. [22] present a program called Prunier for the detection of HGT. The program searches for a maximum statistical agreement forest between a gene tree and a reference tree. Adato et al. [23] provide an algorithm for detecting HGT based on gene synteny and the concept of constant relative mutability. Scornavacca et al. [24] provide an algorithm for detecting HGT in some alternative cases. Sanchez-Soto et al. [25] introduce the algorithm ShadowCaster for HGT detection in prokaryotes.
Some researchers combine HGT with other evolutionary phenomena. Bansal et al. [26] develop the tool RANGER-DTL to detect gene duplication, transfer, and loss. Van Iersel et al. [27] develop a polynomial-time algorithm for some cases of HGT detection. Hasic and Tannier [28] present NP-hard cases for HGT detection.
In addition to the above, there are a number of theoretical approaches to problems related to HGT transfer: [28][29][30][31]. These are typically about mathematically-oriented methodologies for reconstructing a species tree or reconciling gene and species trees.
Also worth discussing, is reticulate evolution. According to [32], there are numerous reticulations among related species, especially in insects, vertebrates, microbes, and plants. In [33], extensions of Wayne Maddison's approach are presented for reconstructing reticulate evolution that result from horizontal transfer or hybrid speciation. Two polynomial time algorithms are presented and outperform both NeighborNet and Maddison's method. Moreover, [34] gives a review of the mathematical techniques used to construct phytogenies and reticulate evolution. Different methods are discussed, among which are distance-based, maximum parsimony, and maximum likelihood methods. In [35], the problem of approximating a dissimilarity matrix using a reticulogram is discussed, where it is obtained by adding edges an additive tree which implies improving the approximation of the dissimilarity matrix. As stated in [36], Horizontal gene transfer (HGT) is one of the most important events in evolution and they describe a new polynomial-time algorithm to infer HGT events. The algorithm uses least squares (LS), Robinson and Foulds (RF) distance, quartet distance (QD), and bipartition dissimilarity (BD). The results show that bipartition dissimilarity gives the best results. Also, in [37] a novel heuristic technique for HGT detetction was employed for and tested on both simulated and real data. The technique was found to provide a greater sensitivity than other HGT techniques. The proposed technique also considers the lengths of the genes being transferred.
In [38] a number of operons have been identified experimentally by sequence similarity analysis and then by phylogenetic analysis. Many occurrences of horizontal transfer of entire operons were detected.
Mosaic genes have been discussed in [39]. A mosaic gene is composed of alternating sequence polymorphisms either belonging to the host original allele or derived from the integrated donor DNA. In this paper, the authors propose a method for detecting partial HGT events and related intragenic recombination giving rise to the formation of mosaic genes.

Constructing an S-tree from a set of posets
Recall the definition of compatible from the Definitions Section. The s i -poset P i = (S, � i ) is compatible with S-tree T if, for all distinct triples x, y, z 2 L(T) such that λ(x) = s i , λ(y) = s j , and λ(z) = s k and such that s j � i s k , then we have the shortest path from either of x or y to z passes through MRCA (x, y).
The problem of constructing a tree is defined as follows: SOLUTION: An S-tree T compatible with P 1 , P 2 , . . ., P n , if one exists.
Theorem 1. Let P be a set of posets that is compatible with an S-tree T. Let T 0 be a refinement of T. Then P is compatible with T 0 .
Proof. The proof is by induction on the number of refinement steps, k, to obtain T 0 from T. For the base case of the induction, assume that k = 0. Then T 0 = T, and, therefore, P is clearly compatible with T 0 . Now assume that k � 1 and that the result holds for k − 1 refinement steps. Then there exists an S-tree T 00 such that T 00 is obtained by k − 1 refinement steps from T and T 0 is obtained from T 00 in one refinement step. Let u in T 00 have children v 1 Note that q � 2 and p − q � 1. Therefore, for P to be compatible with T 0 , the compatibility condition must hold, and that is: • For all distinct triples x, y, z 2 L(T) such that λ(x) = s i , λ(y) = s j , and λ(z) = s k and such that s j � i s k , then there is a shortest path from either of x or y to z passing through MRCA (x, y).
By applying the compatibility condition to T 00 , the cases for x, y, and z are as follows: Since s j � i s k , therefore, there exists an MRCA for x and y. Let MRCA (x, y) be q. Therefore, the shortest path from either of x or y to z passes through q.
• x, y 2 v 1 , v 2 , . . ., v p . Therefore, MRCA (x, y) is u, and the shortest path from either of x or y to z passes through u.
Since, s j � i s k , therefore, there exists an MRCA for x and y such that the shortest path from either of x or y to z passes through the MRCA (x, y).
Similarly, by applying the compatibility condition to T 0 , the cases for x, y, and z are as follows: • x, y 2 v 1 , v 2 , . . ., v q . Therefore, MRCA (x, y) is w, and the shortest path from either of x or y to z passes through w.
• x, y 2 v q + 1 , . . ., v p . Therefore, MRCA (x, y) is u, and the shortest path from either of x or y to z passes through u.
. ., v q and y 2 v q + 1 , . . ., v p . Therefore, MRCA (x, y) is u and the shortest path from either of x or y to z passes through u.
• y 2 v 1 , v 2 , . . ., v q and x 2 v q + 1 , . . ., v p . Therefore, MRCA (x, y) is u and the shortest path from either of x or y to z passes through u.
Since s j � i s k , therefore, there exists an MRCA for x and y. Let MRCA (x, y) be q. Therefore, the shortest path from either of x or y to z passes through q.
Since, s j � i s k , therefore, there exists and MRCA for x and y such that the shortest path from either of x or y to z passes through the MRCA (x, y). Therefore, if the compatibility condition holds for T 00 , and T 0 is obtained using one refinement step from T 00 , then the compatibility condition also holds for T 0 .
By induction, P is compatible with T 0 , as required. Now we present a data structure that the algorithm uses to identify siblings. For the set of posets, P, a matrix A of size n × n is defined. We define is the number of species s x such that s j is strictly less than s x in the poset (S, � i ). Theorem 2. Let P be a set of posets, and let A be the matrix representing P. If P is consistent, then A is symmetric.
Proof. Let P ¼ fP 1 ; P 2 ; . . . ; P n j P i is an s i À posetg be a set of posets. P is consistent if, for all posets P i , P j 2 P, whenever s j � i s k , then This A matrix represents an undirected graph, where siblings are indicated by cliques in the graph, that is, for a species s i , all other species connected to s i with edges having equal labels, then they are siblings. Higher values indicate siblings at lower levels in the tree, in other words, the maximum value indicates leaf siblings. Note that if there is missing data or incorrect data in the posets, there will be a problem in constructing the tree, for example, if the posets have missing information or incorrect information then the algorithm will not be able to construct a tree for that specific gene corresponding to that posets set. To follow is an example to illustrate the defined data structures. Consider the set of posets P, where P is given as follows: The matrix A corresponding to P is shown in Table 1 And the graph G that is represented by the matrix A given in Table 1 is shown in Fig 9, where s 1 , s 2 , and s 3 are siblings, and their parent and s 4 are both children of the root.
To follow is an example to illustrate the data structures used in tree construction. The matrix shown in Table 2 is constructed for the posets in Fig 10. The graph in Fig 11 shows the cliques that represent siblings indicated by matrix A in Table 2.
The first row of matrix A indicates that s 2 is a sibling of s 1 . The maximum value in the s 1 row is 3, which is in the s 2 column, and it is the only column with this value. This is also clear in the graph shown in Fig 11. Since the maximum value found in the s 1 row is 3, and it is only under the s 2 column, therefore, s 2 is the only sibling of s 1 . Similarly, s 4 and s 5 are also siblings.

PLOS ONE
The algorithm starts by the procedure of inferring siblings by detecting cliques in the graph. For each species, the algorithm scans the row corresponding to that species, and detects which species are connected using edges with equal labels. The detected species are all siblings. After detecting each set of siblings comes the updating step. In this step, the rows and columns of the siblings are merged. This procedure is repeated until only one species is remaining, which is the root.
After scanning the s 1 row, the matrix A is reduced as shown in Table 3. Similarly, the matrix A is reduced after detecting the siblings s 4 and s 5 , as shown in Table 4.

PLOS ONE
This procedure is repeated, but this time the highest integer is 2, therefore, s 3 is a sibling of s 12 , the parent of s 1 and s 2 . And, the new matrix is shown in Table 5.
The final step creates one root for the remaining species because all the values are 0, hence, all the remaining species are at the same level. The tree reconstructed from the posets in Fig 10  is shown in Fig 12. Another example to further illustrate the algorithm uses the set of posets P in Fig 13. The matrix in Table 6 is constructed for the set of posets P in Fig 13.

PLOS ONE
The largest integer is 3, and it indicates that s 1 , s 2 , and s 3 are siblings, as well as s 4 , s 5 , and s 6 . The matrix then becomes as shown in Table 7. Therefore, one root is created for the remaining two nodes to construct the tree in Fig 14. To follow is an example to illustrate how the algorithm works to construct an S-tree from a set of posets P ¼ fP 1 ; P 2 ; . . . ; P n g.
Given a set of species, S = {s 1 , s 2 , s 3 , s 4 , s 5 }, with the set of posets P in Fig 15. The corresponding A matrix is shown in Table 8. Therefore, the maximum is 3, with the siblings s 1 and s 2 , as well as s 3 and s 4 . And, the matrix A becomes as shown in Table 9.  Table 2. https://doi.org/10.1371/journal.pone.0281824.g011

PLOS ONE
Now, s 5 is a sibling of both s 12 and s 34 , giving one root for the three nodes. The constructed tree is shown in Fig 16. The algorithm also uses a subroutine to find cliques with equal edge labels. The subroutine scans the matrix A to find a clique with maximum edge labels. The subroutine AddSiblings shows the steps for adding the vertices that belong to a certain clique as siblings in the tree T. The subroutine also reduces the graph by merging the rows and columns in the matrix A.   Table 7. Updated matrix A for posets in Fig 13. x y x -1 0 Proof. To prove the theorem, we use induction on the number of species. Let the number of species be n. For n = 1 and n = 2, there is no maximum value in the matrix A, hence, the tree is trivial. For n = 3, there are three possibilities for the third species s 3 . Either s 3 is a sibling of s 1 and s 2 , a sibling of their parent, or a sibling of either one of them. The algorithm checks the values in the A matrix, if A(1, 3) = A(2, 3) = A(1, 2), then s 3 is a sibling of s 1 and s 2 , otherwise, s 3 is a sibling of their parent. In case of s 1 and s 2 not being siblings, then the values in the A matrix will detect s 3 as a sibling of either one of them, that is the third possibility. After detecting siblings, the matrix A is reduced by eliminating the siblings and replacing them by their parent. Therefore, for n species, the algorithm scans the matrix A, and at each step, the siblings are eliminated and replaced by their parent, this reduces the matrix A, until only one species is remaining, which is the root.

Generating a set of posets from a given S-tree
For each tree T, there exists a set of posets P compatible with T. In this section, we show how given a tree T, the set of compatible posets can be generated.

PLOS ONE
A set of posets P is compatible with an S-tree T if, for all distinct triples x, y, z 2 L(T) such that λ(x) = s i , λ(y) = s j , and λ(z) = s k and such that s j � i s k , then we have the shortest path from either of x or y to z passes through MRCA (x, y). Therefore, the procedure of obtaining posets from a tree is straightforward. Given a tree T, it is clear which species are closer to each other than others, and hence, posets can be generated. By obtaining the path from each species (leaf node) to the root of the tree, and laying this path horizontally, we get the nodes sorted in order of closeness to this specific leaf node. Each node on the path represents a subtree, of which the leaves belonging to the species set represent one level of the poset.
An example to illustrate how posets are generated from a tree is shown in Fig 22. The tree on the right shows the path from s 1 to the root, where each node on the path is a root to a subtree, and the leaves belonging to each subtree represent a level of the poset P 1 . The subtree with the root s 1 has only one leaf and that is s 1 . The second level of the poset contains the leaves in  Proof. Using a proof by construction, we show that the algorithm GeneratePosets generates the set of posets P compatible with a given tree T. From the definition of compatible in Section 2 of the main document, we know that an s i -poset P i = (S, � i ) is compatible with S-tree T if, for all distinct triples x, y, z 2 L(T) such that λ(x) = s i , λ(y) = s j , and λ(z) = s k and such that s j � i s k , then we have the shortest path from either of x or y to z passes through MRCA (x, y). The algorithm GeneratePosets finds, for a species s i , the path p from s i to the root r, on that path, the nodes that come first on the path p are definitely closer to s i and, hence, come at a lower level in the poset. That follows from the definition of compatible, which indicates that if s j � i s k , then the shortest path from either of x or y to z passes through MRCA (x, y). Therefore, by scanning the path p, the set of posets P can be constructed.

Theorem 6. The algorithm GeneratePosets has a time complexity of O(n 3 ).
Proof. Let the number of species be n. The loop on line 2 iterates n times, and on line 3, finding the path from a certain species to the root is also linear in the number of species, this gives

Relating posets to trees
The following theorems relate posets and trees to one another. Theorem 7. Given a set of posets P, if there exists an S-tree T that P is compatible with, then T can be used to generate the same set of posets P.
Proof. Given a set of posets P, assume that P is compatible with a tree T. Assume that T, in turn, generates a different set of posets P 0 . P 0 can now be used to construct a tree T 0 that is compatible with P 0 , T 0 is expected to be equivalent to T. However, since P 0 and P are not equal, then the two trees constructed are also not the same. Since, T and T 0 are different, therefore, T and T 0 can yield contradictory 2-partitions, this means that that T and T 0 may be contradictory trees, and hence, one of them can not be used to give the same set of posets. Hence, there is a contradiction, and T can not be used to generate a set of posets other than P. Theorem 8. Let P 1 and P 2 be two sets of posets that are compatible with the two S-trees, T 1 and T 2 . Then T 1 and T 2 are contradictory if and only if there exists a poset P i 2 P 1 and P j 2 P 2 , such that P i 2 P 1 is inconsistent with P j 2 P 2 .
Proof. First, we prove that if T 1 and T 2 are contradictory then there exists a poset P i 2 P 1 and a poset P j 2 P 2 , such that P i 2 P 1 is inconsistent with P j 2 P 2 . Using a proof by contradiction, assume that T 1 and T 2 are contradictory and there is no poset P i 2 P 1 and P j 2 P 2 , such that P i 2 P 1 is inconsistent with P j 2 P 2 . Since, T 1 and T 2 are contradictory, therefore, there exists an edge in T 1 and an edge in T 2 , that when cut induces contradictory 2-partitions. This means that there exists four species s 1 , s 2 , s 3 , and s 4 , such that s 1 and s 2 belong to the same partition in one tree but not in the other. Similarly, s 3 and s 4 belong to the same partition in

PLOS ONE
one tree but not in the other. Since, the set of posets P 1 is compatible with T 1 and the set of posets P 2 is compatible with T 2 , and since T 1 and T 2 are contradictory, therefore, there exists a poset P i 2 P 1 and a poset P j 2 P 2 such that P i 2 P 1 is inconsistent with P j 2 P 2 . This leads to a contradiction with the assumption.
The second part of the proof proves that if there exists a poset P i 2 P 1 and a poset P j 2 P 2 , such that P i 2 P 1 is inconsistent with P j 2 P 2 , then T 1 and T 2 are contradictory. Using a proof by contradiction, assume that there exists a poset P i 2 P 1 and a poset P j 2 P 2 , such that P i 2 P 1 is inconsistent with P j 2 P 2 while T 1 and T 2 are non-contradictory. If P i 2 P 1 is inconsistent with P j 2 P 2 , therefore, P 1 is inconsistent with the set of posets P 2 , hence, the two sets of posets can create contradictory 2-partitions in their corresponding trees, and therefore, the trees that are compatible with both sets of posets can not be non-contradictory, and this leads to a contradiction with the assumption. Therefore, the theorem follows. The poset P 1 2 P 1 indicates that s 2 is a sibling of s 1 , while the poset P 1 2 P 2 indicates that s 3 is a sibling of s 1 . Therefore, the two posets are inconsistent.

Refinement of trees
We start with a basic result about refinement (Theorem 9).

Lemma 1. Let T be an S-tree. Let Q be the 2-partition set of T. Then Q is not contradictory with itself.
Proof. We show that every pair of 2-partitions in Q is non-contradictory. Consider an arbitrary pair of distinct edges of T. This pair of edges are the ends of a unique path in T. Let u 0 , u 1 , . . ., u k − 1 , u k be that path. Then the edges are (u 0 , u 1 ) and (u k − 1 , u k ). These edges partition S into three sets: X, the set of species reachable from u 0 without using (u 0 , u 1 ); Y, the set of species reachable from u k without using (u k − 1 , u k ); and Z, the set of species reachable from u 1 , u 2 , . . ., u k − 1 without using (u 0 , u 1 ) or (u k − 1 , u k ). The 2-partition corresponding to (u 0 , u 1 ) is (X, Y [ Z), and the 2-partition corresponding to (u k − 1 , u k ) is (X [ Z, Y). Recall the definition of contradictory 2-partitions: Two 2-partitions X = (X 1 , X 2 ) and Y = (Y 1 , Y 2 ) are contradictory partitions if there exist four species s 1 , s 2 , s 3 , s 4 such that s 1 , s 2 2 X 1 , s 3 , s 4 2 X 2 , s 1 , s 3 2 Y 1 , and s 2 , s 4 2 Y 2 . Let s 1 , s 2 , s 3 , s 4 2 S. If s 1 , s 2 2 X and s 3 , s 4 2 Y [ Z, then s 1 , s 2 2 X [ Z, so the definition definitely does not apply to the 2-partitions corresponding to (u 0 , u 1 ) and (u k − 1 , u k ). Since the two edges were arbitrary, we conclude that Q is not contradictory with itself. Lemma 2. Let T 1 be an S-tree, and let T 2 be a refinement of T 1 . Let Q 1 be the 2-partition set of T 1 , and let Q 2 be the 2-partition set of T 2 . Then Q 1 � Q 2 .
Proof. A refinement step adds one edge to T 1 and one 2-partition. By induction on the number of refinement steps to go from T 1 to T 2 , we obtain Q 1 � Q 2 .
Theorem 9. If S-tree T 2 can be obtained from S-tree T 1 using a number of refinement steps, then T 1 and T 2 are non-contradictory.

PLOS ONE
Proof. Let T 1 be an S-tree, and let T 2 be a refinement of T 1 . Let Q 1 be the set of 2-partitions of T 1 , and let Q 2 be the set of 2-partitions of T 2 . By Lemma 2, Q 1 � Q 2 . By Lemma 1, Q 2 is not contradictory with itself. Then Q 1 and Q 2 are non-contradictory, since otherwise Q 2 would be contradictory with itself. By definition, T 1 and T 2 are non-contradictory.
The posets given for each gene are used in the construction of one tree for each gene. These trees can contain contradictory information, as illustrated in Fig 24. To be able to identify HGT events, contradictory trees must be identified. This can be done by examining the number of ways leaves and the root in a tree can be partitioned. This is done by examining the cuts in edges that are not incident to leaf nodes. If two trees are contradictory, then there is evidence for HGT.
The minimum common refinement of two non-contradictory S-trees T 1 and T 2 is an S-tree T 3 that is a common refinement of T 1 and T 2 such that any other common refinement of T 1 and T 2 is a refinement of T 3 .
Theorem 10. Let T 1 and T 2 be S-trees that are non-contradictory. Let Q 1 and Q 2 be their respective sets of 2-partitions. Then there exists a unique tree T 3 that is their minimum common refinement. Furthermore, if Q 3 is the set of 2-partitions of T 3 , then Proof. Define Q 3 = Q 1 [ Q 2 . Therefore, Q 3 contains 2-partitions, where each 2-partition is obtained by cutting one edge of the tree T 3 . Hence, the set Q 3 can be used to construct the tree T 3 , by checking each 2-partition, starting with the 2-partition of minimum cardinality. Siblings in T 3 are inferred and the set is reduced. This process is repeated until only 2-partitions with one of its elements having cardinality one are remaining. Since Q 3 = Q 1 [ Q 2 and since Q 1 already corresponds to a tree and also Q 2 corresponds to a tree, all the 2-partitions in Q 1 and Q 2 already correspond to edges in a tree. Therefore, using the two sets, a more refined tree can be constructed. Since Q 1 and Q 2 both contain non-contradictory partitions, and since Q 3 = Q 1 [ Q 2 , Q 3 also contains non-contradictory partitions, and hence, there exists a tree T 3 that corresponds to Q 3 . Using induction, we start by Q 1 and T 1 and add 2-partitions from Q 2 to Q 1 . Let k be the number of 2-partitions added. If k = 1, then a 2-partition is added from Q 2 to Q 1 . Since T 1 and T 2 are non-contradictory, a 2-partition that exists in Q 2 but not in Q 1 only adds an internal node and an edge to T 1 . Therefore, T 1 becomes a more refined tree. Hence, adding k 2-partitions to T 1 will further refine T 1 by adding more edges and internal nodes. Therefore, given Q 3 , a set of non-contradictory 2-partitions, a tree T 3 can be constructed.
An algorithm for finding the minimum common refinement of T 1 and T 2 is shown in Fig  25. The algorithm finds all 2-partitions of T 1 and T 2 . A 2-partition is found by cutting an edge of the tree and finding the leaves in the two subtrees induced. For example, cutting an edge (i, j), induces two subtrees, one with the root i and the other with the root j. Performing a depthfirst search on the two subtrees finds the leaves in both subtrees. The species set for each subtree composes one of the 2-partitions; therefore, S(i) composes one partition, and S(j) composes the other.
The subroutine FindTwoPartitions shown in Fig 26 finds the 2-partition set for a given tree. When the 2-partitions sets are found for both trees, a union is performed on these sets to obtain the minimum common refinement tree.
The algorithm that constructs a tree from its two-partition set is shown in Fig 27, followed by an illustrative example.
An example to show the minimum common refinement, given two S-trees, T 1 and T 2 , if using a number of refinement steps both trees can be refined into a third S-tree T 3 , then it is guaranteed that both trees carry non-contradictory information. For example, the two S-trees, T 1 and T 2 shown in Fig 28 are non-contradictory and they are both refined into T 3 . In this example, T 3 is obtained using the minimum number of refinement steps, hence, T 3 is the minimum common refinement of T 1 and T 2 .

Fig 29
shows an example to illustrate minimum common refinement, where the tree T 3 is the minimum common refinement of the two trees T 1 and T 2 , where T 3 is obtained using one refinement step, this refinement step is performed on T 1 by adding a parent for s 3 and s 4 . The refined tree is the same tree as T 2 . Fig 30 shows an example to illustrate the algorithm. The node s 0 is added under the root to avoid having equivalent sets for a 2-partition, as these equivalent sets disappear when

PLOS ONE
Lets consider the following two-partition set, Q, to illustrate the algorithm. Q 1 ¼ fs 0 g; fs 1 ; s 2 ; s 3 ; s 4 ; s 5 g Q 2 ¼ fs 1 g; fs 0 ; s 2 ; s 3 ; s 4 ; s 5 g Q 3 ¼ fs 2 g; fs 0 ; s 1 ; s 3 ; s 4 ; s 5 g Q 4 ¼ fs 3 g; fs 0 ; s 1 ; s 2 ; s 4 ; s 5 g Q 5 ¼ fs 4 g; fs 0 ; s 1 ; s 2 ; s 3 ; s 5 g Q 6 ¼ fs 5 g; fs 0 ; s 1 ; s 2 ; s 3 ; s 4 g Q 7 ¼ fs 1 ; s 2 g; fs 0 ; s 3 ; s 4 ; s 5 g Q 8 ¼ fs 1 ; s 2 ; s 3 g; fs 0 ; s 4 ; s 5 g Q 9 ¼ fs 0 ; s 1 ; s 2 ; s 3 g; fs 4 ; s 5 g The algorithm starts by removing all sets with cardinality 1. So the set Q is reduced to the following: Q 7 ¼ fs 1 ; s 2 g; fs 0 ; s 3 ; s 4 ; s 5 g Q 8 ¼ fs 1 ; s 2 ; s 3 g; fs 0 ; s 4 ; s 5 g Q 9 ¼ fs 0 ; s 1 ; s 2 ; s 3 g; fs 4 ; s 5 g The set with the minimum cardinality is in Q 7 , therefore, the species s 1 and s 2 are detected as siblings and they are replaced by a parent node in all sets. Therefore, Q is modified to the following: Proof. Let n be the number of species. Let m be the number of edges in a tree T. The subroutine FindTwoPartitions on Lines 3 and 4 is O(mn) Line 5 performs a union operation linear in the number of species. Line 6 constructs the tree from its two-partition set, ConstructTree2-Partitions is O(n 2 ). Therefore, the overall complexity of the algorithm MinCommonRefine is O(mn + n 2 ).

Inferring HGT from posets
In this section, we show how posets and trees are used to infer HGT.
The problem is defined as follows:

PLOS ONE
The process of identifying which genes are candidates of HGT proceeds as follows. Two Strees T 1 and T 2 are tested for contradiction. If they are contradictory, then they belong to two different sets, if not then they are placed in one set. The process continues. If the next tree to be tested is T 3 , then it is compared with one tree from each set to test to which set the tree T 3 belongs. It is expected that the majority of the trees will be non-contradictory, with some trees contradicting this majority, so there will be one set with a higher cardinality. Therefore, the other sets, which are the minority, are considered candidates for HGT.
The algorithm performs ideally when all the trees are completely refined (binary) trees, where the trees that are not identical are considered contradictory. In what follows, some real life HGT examples are shown to support the argument that the genes involved in HGT are a minority and that there will always be a dominant tree. In Ponting [41], it is indicated that only 0.5% of all human genes were copied into the genome from bacteria by HGT. Rujan and Martin [42] analyzed how many genes in Arabidopsis come from cyanobacteria, They used a sample of 3961 Arabidopsis nuclear protein-coding genes and compared those with the complete set of proteins from yeast and 17 reference prokaryotic genomes, including one cyanobacterium. In their analysis of 386 phylogenetic trees, they found that the number of genes horizontally transferred to Arabidopsis from cyanobacteria falls between approximately 400 genes and approximately 2200 genes. That is between 1.6% and 9.2% of nuclear genes.
The algorithm InferHGT is shown in Fig 33. The input to the algorithm is a set of trees T = {T 1 , T 2 , . . ., T n }, where n is the number of trees and also the number of genes.
An example to illustrate the algorithm for inferring HGT is shown in Fig 34, where the trees T 1 , T 2 , and T 3 are non-contradictory, while the tree T 4 contradicts the three trees. In T 4 there is a 2-partition that places the two species {s 1 , s 3 } in one partition, and {s 2 , s 4 } in another partition. This 2-partition contradicts the other three trees. Therefore, the gene corresponding to T 4 is a candidate of HGT, where a horizontal transfer occurred between s 1 and s 3 , or s 2 and s 4 . The network in Fig 1 shows the possible horizontal transfers. We note that the figure documents both the existence of two possible horizontal transfers but also their directionality, which is especially valuable for any further investigation.

Conclusions
We have introduced the theoretical problem of inferring HGT using partial orders, where there is one poset per gene per species. These posets have been used to construct S-trees for the genes corresponding to these posets, one tree for each gene. These trees are then compared, where the trees that contradict the majority of trees correspond to genes that are candidates for HGT. An algorithm for identifying contradiction is presented and then used in the algorithm to infer HGT. The concept of refinement is also presented in this paper, where it can also be used to identify contradiction among trees. An algorithm for finding a minimum common refinement for two trees is also presented. This algorithm finds the union of the 2-partition sets of two trees and then uses this set to construct a third tree, which is their minimum common refinement. Other points can be further studied in this problem. For example, more effort could be done to find solutions to the problem of incorrect or missing data in the input posets. This will be incredibly challenging, but, from a practical viewpoint, it would be most valuable. Another point is to develop algorithms that use the refinement of trees for identifying contradictory trees, where two contradictory trees do not have a common refinement.